AITopics | randomized exploration

We present the first study on provably efficient randomized exploration in cooperative multi-agent reinforcement learning (MARL). We propose a unified algorithm framework for randomized exploration in parallel Markov Decision Processes (MDPs), and two Thompson Sampling (TS)-type algorithms, CoopTS-PHE and CoopTS-LMC, incorporating the perturbed-history exploration (PHE) strategy and the Langevin Monte Carlo exploration (LMC) strategy respectively, which are flexible in design and easy to implement in practice. For a special class of parallel MDPs where the transition is (approximately) linear, we theoretically prove that both CoopTS-PHE and CoopTS-LMC achieve a $\widetilde{\mathcal{O}}(d^{3/2}H^2\sqrt{MK})$ regret bound with communication complexity $\widetilde{\mathcal{O}}(dHM^2)$, where $d$ is the feature dimension, $H$ is the horizon length, $M$ is the number of agents, and $K$ is the number of episodes. This is the first theoretical result for randomized exploration in cooperative MARL. We evaluate our proposed method on multiple parallel RL environments, including a deep exploration problem (i.e., $N$-chain), a video game, and a real-world problem in energy systems. Our experimental results support that our framework can achieve better performance, even under conditions of misspecified transition models. Additionally, we establish a connection between our unified framework and the practical application of federated learning.

artificial intelligence, machine learning, reinforcement learning, (9 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.65)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.62)

Add feedback

Near-OptimalRandomizedExplorationforTabular MarkovDecisionProcesses

Neural Information Processing SystemsFeb-8-2026, 01:07:08 GMT

These algorithms inject (carefully tuned) random noise to value function to encourage exploration. UCB-type algorithms enjoy well-established theoretical guarantees but suffer from difficult implementation since an upper confidence bound isusually infeasible for manypractical models like neural networks. Instead, practitioners prefer randomized exploration such as noisy networks in [19], and algorithms with randomized exploration have been widely used in practice [37,13,11,35].

artificial intelligence, arxivpreprintarxiv, machine learning, (18 more...)

Neural Information Processing Systems

Country:

North America > United States > California (0.04)
Europe > United Kingdom > England (0.04)
Europe > Romania > Sud-Est Development Region > Constanța County > Constanța (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

8839556afa4efe2a7dc5aae5e0a22fd4-Paper-Conference.pdf

Neural Information Processing SystemsSep-28-2025, 19:07:12 GMT

artificial intelligence, machine learning, reinforcement learning, (17 more...)

Neural Information Processing Systems

Genre: Research Report > Experimental Study (0.92)

Industry:

Energy > Oil & Gas > Upstream (0.68)
Leisure & Entertainment > Games > Computer Games (0.46)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.69)
Information Technology > Data Science > Data Mining (0.67)

Add feedback

Randomized Exploration in Cooperative Multi-Agent Reinforcement Learning

Neural Information Processing SystemsAug-20-2025, 23:07:00 GMT

We present the first study on provably efficient randomized exploration in cooperative multi-agent reinforcement learning (MARL).

artificial intelligence, machine learning, reinforcement learning, (17 more...)

Neural Information Processing Systems

Country: North America > United States (0.13)

Genre: Research Report > Experimental Study (0.92)

Industry:

Energy > Oil & Gas > Upstream (0.68)
Leisure & Entertainment > Games > Computer Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.84)

Add feedback

Randomized Exploration for Reinforcement Learning with Multinomial Logistic Function Approximation

Neural Information Processing SystemsMay-27-2025, 08:17:17 GMT

We study reinforcement learning with _multinomial logistic_ (MNL) function approximation where the underlying transition probability kernel of the _Markov decision processes_ (MDPs) is parametrized by an unknown transition core with features of state and action. For the finite horizon episodic setting with inhomogeneous state transitions, we propose provably efficient algorithms with randomized exploration having frequentist regret guarantees. Here, d is the dimension of the transition core, H is the horizon length, T is the total number of steps, and \kappa is a problem-dependent constant. Despite the simplicity and practicality of \texttt{RRL-MNL}, its regret bound scales with \kappa {-1}, which is potentially large in the worst case. To improve the dependence on \kappa {-1}, we propose \texttt{ORRL-MNL}, which estimates the value function using local gradient information of the MNL transition model.

multinomial logistic function approximation, randomized exploration, reinforcement learning, (11 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.77)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Fuzzy Logic (0.64)

Add feedback

Randomized Exploration in Cooperative Multi-Agent Reinforcement Learning

Neural Information Processing SystemsMay-27-2025, 07:47:06 GMT

We present the first study on provably efficient randomized exploration in cooperative multi-agent reinforcement learning (MARL). We propose a unified algorithm framework for randomized exploration in parallel Markov Decision Processes (MDPs), and two Thompson Sampling (TS)-type algorithms, CoopTS-PHE and CoopTS-LMC, incorporating the perturbed-history exploration (PHE) strategy and the Langevin Monte Carlo exploration (LMC) strategy respectively, which are flexible in design and easy to implement in practice. For a special class of parallel MDPs where the transition is (approximately) linear, we theoretically prove that both CoopTS-PHE and CoopTS-LMC achieve a \widetilde{\mathcal{O}}(d {3/2}H 2\sqrt{MK}) regret bound with communication complexity \widetilde{\mathcal{O}}(dHM 2), where d is the feature dimension, H is the horizon length, M is the number of agents, and K is the number of episodes. This is the first theoretical result for randomized exploration in cooperative MARL. We evaluate our proposed method on multiple parallel RL environments, including a deep exploration problem (i.e., N -chain), a video game, and a real-world problem in energy systems.

cooperative multi-agent reinforcement learning, exploration, randomized exploration, (4 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.98)

Add feedback

Randomized Exploration in Cooperative Multi-Agent Reinforcement Learning

Hsu, Hao-Lun, Wang, Weixin, Pajic, Miroslav, Xu, Pan

arXiv.org Machine LearningApr-16-2024

We present the first study on provably efficient randomized exploration in cooperative multi-agent reinforcement learning (MARL). We propose a unified algorithm framework for randomized exploration in parallel Markov Decision Processes (MDPs), and two Thompson Sampling (TS)-type algorithms, CoopTS-PHE and CoopTS-LMC, incorporating the perturbed-history exploration (PHE) strategy and the Langevin Monte Carlo exploration (LMC) strategy respectively, which are flexible in design and easy to implement in practice. For a special class of parallel MDPs where the transition is (approximately) linear, we theoretically prove that both CoopTS-PHE and CoopTS-LMC achieve a $\widetilde{\mathcal{O}}(d^{3/2}H^2\sqrt{MK})$ regret bound with communication complexity $\widetilde{\mathcal{O}}(dHM^2)$, where $d$ is the feature dimension, $H$ is the horizon length, $M$ is the number of agents, and $K$ is the number of episodes. This is the first theoretical result for randomized exploration in cooperative MARL. We evaluate our proposed method on multiple parallel RL environments, including a deep exploration problem (\textit{i.e.,} $N$-chain), a video game, and a real-world problem in energy systems. Our experimental results support that our framework can achieve better performance, even under conditions of misspecified transition models. Additionally, we establish a connection between our unified framework and the practical application of federated learning.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

arXiv.org Machine Learning

2404.10728

Genre: Research Report (0.63)

Industry:

Energy > Oil & Gas > Upstream (0.68)
Leisure & Entertainment > Games > Computer Games (0.66)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.90)

Add feedback

Near-Optimal Randomized Exploration for Tabular Markov Decision Processes

Xiong, Zhihan, Shen, Ruoqi, Cui, Qiwen, Fazel, Maryam, Du, Simon S.

arXiv.org Artificial IntelligenceOct-12-2022

We study algorithms using randomized value functions for exploration in reinforcement learning. This type of algorithms enjoys appealing empirical performance. We show that when we use 1) a single random seed in each episode, and 2) a Bernstein-type magnitude of noise, we obtain a worst-case $\widetilde{O}\left(H\sqrt{SAT}\right)$ regret bound for episodic time-inhomogeneous Markov Decision Process where $S$ is the size of state space, $A$ is the size of action space, $H$ is the planning horizon and $T$ is the number of interactions. This bound polynomially improves all existing bounds for algorithms based on randomized value functions, and for the first time, matches the $\Omega\left(H\sqrt{SAT}\right)$ lower bound up to logarithmic factors. Our result highlights that randomized exploration can be near-optimal, which was previously achieved only by optimistic algorithms. To achieve the desired result, we develop 1) a new clipping operation to ensure both the probability of being optimistic and the probability of being pessimistic are lower bounded by a constant, and 2) a new recursive formula for the absolute value of estimation errors to analyze the regret.

artificial intelligence, machine learning, probability, (13 more...)

arXiv.org Artificial Intelligence

2102.09703

Country: